{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b03a4ef8",
   "metadata": {},
   "source": [
    "# AutoMM for Text - Multilingual Problems\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/text_prediction/multilingual_text.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/text_prediction/multilingual_text.ipynb)\n",
    "\n",
    "\n",
    "\n",
    "People around the world speaks lots of languages. According to [SIL International](https://en.wikipedia.org/wiki/SIL_International)'s [Ethnologue: Languages of the World](https://en.wikipedia.org/wiki/Ethnologue), \n",
    "there are more than **7,100** spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of \n",
    "real-world problems involve text written in languages other than English.\n",
    "\n",
    "In this tutorial, we introduce how `MultiModalPredictor` can help you build multilingual models. For the purpose of demonstration, \n",
    "we use the [Cross-Lingual Amazon Product Review Sentiment](https://webis.de/data/webis-cls-10.html) dataset, which \n",
    "comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. \n",
    "We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways:\n",
    "\n",
    "- Finetune the German BERT\n",
    "- Cross-lingual transfer from English to German\n",
    "\n",
    "*Note:* You are recommended to also check [Single GPU Billion-scale Model Training via Parameter-Efficient Finetuning](../advanced_topics/efficient_finetuning_basic.ipynb) about how to achieve better performance via parameter-efficient finetuning. \n",
    "\n",
    "## Load Dataset\n",
    "\n",
    "The [Cross-Lingual Amazon Product Review Sentiment](https://webis.de/data/webis-cls-10.html) dataset contains Amazon product reviews in four languages. \n",
    "Here, we load the English and German fold of the dataset. In the label column, `0` means negative sentiment and `1` means positive sentiment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa00faab-252f-44c9-b8f7-57131aa8251c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install autogluon.multimodal\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1568130d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip\n",
    "!unzip -q -o amazon_review_sentiment_cross_lingual.zip -d ."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e6d8762",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',\n",
    "                          sep='\\t', header=None, names=['label', 'text']) \\\n",
    "                .sample(1000, random_state=123)\n",
    "train_de_df.reset_index(inplace=True, drop=True)\n",
    "\n",
    "test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',\n",
    "                          sep='\\t', header=None, names=['label', 'text']) \\\n",
    "               .sample(200, random_state=123)\n",
    "test_de_df.reset_index(inplace=True, drop=True)\n",
    "print(train_de_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb9af1b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',\n",
    "                          sep='\\t',\n",
    "                          header=None,\n",
    "                          names=['label', 'text']) \\\n",
    "                .sample(1000, random_state=123)\n",
    "train_en_df.reset_index(inplace=True, drop=True)\n",
    "\n",
    "test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',\n",
    "                          sep='\\t',\n",
    "                          header=None,\n",
    "                          names=['label', 'text']) \\\n",
    "               .sample(200, random_state=123)\n",
    "test_en_df.reset_index(inplace=True, drop=True)\n",
    "print(train_en_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd6e6f88",
   "metadata": {},
   "source": [
    "## Finetune the German BERT\n",
    "\n",
    "Our first approach is to finetune the [German BERT model](https://www.deepset.ai/german-bert) pretrained by deepset. \n",
    "Since `MultiModalPredictor` integrates with the [Huggingface/Transformers](https://huggingface.co/docs/transformers/index) (as explained in [Customize AutoMM](../advanced_topics/customization.ipynb)), \n",
    "we directly load the German BERT model available in Huggingface/Transformers, with the key as [bert-base-german-cased](https://huggingface.co/bert-base-german-cased). \n",
    "To simplify the experiment, we also just finetune for 4 epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7847c87d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor\n",
    "\n",
    "predictor = MultiModalPredictor(label='label')\n",
    "predictor.fit(train_de_df,\n",
    "              hyperparameters={\n",
    "                  'model.hf_text.checkpoint_name': 'bert-base-german-cased',\n",
    "                  'optim.max_epochs': 2\n",
    "              })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5c74da3e",
   "metadata": {},
   "outputs": [],
   "source": [
    "score = predictor.evaluate(test_de_df)\n",
    "print('Score on the German Testset:')\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b579571",
   "metadata": {},
   "outputs": [],
   "source": [
    "score = predictor.evaluate(test_en_df)\n",
    "print('Score on the English Testset:')\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ea2ef6b",
   "metadata": {},
   "source": [
    "We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. \n",
    "Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for **both German and English**.\n",
    "\n",
    "## Cross-lingual Transfer\n",
    "\n",
    "In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. \n",
    "This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the \n",
    "other language (e.g., German) to English and apply the English model.\n",
    "However, as showed in [\"Unsupervised Cross-lingual Representation Learning at Scale\"](https://arxiv.org/pdf/1911.02116.pdf), \n",
    "there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining.\n",
    "The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct *zero-shot* cross lingual transfer, \n",
    "meaning that you can directly apply the model trained in the English dataset to datasets in other languages. \n",
    "It also outperforms the baseline \"TRANSLATE-TEST\", meaning to translate the data from other languages to English and apply the English model. \n",
    "\n",
    "In AutoGluon, you can just turn on `presets=\"multilingual\"` in MultiModalPredictor to load a backbone that is suitable for zero-shot transfer. \n",
    "Internally, we will automatically use state-of-the-art models like [DeBERTa-V3](https://arxiv.org/abs/2111.09543)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64cfea80",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor\n",
    "\n",
    "predictor = MultiModalPredictor(label='label')\n",
    "predictor.fit(train_en_df,\n",
    "              presets='multilingual',\n",
    "              hyperparameters={\n",
    "                  'optim.max_epochs': 2\n",
    "              })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72e73dc9",
   "metadata": {},
   "outputs": [],
   "source": [
    "score_in_en = predictor.evaluate(test_en_df)\n",
    "print('Score in the English Testset:')\n",
    "print(score_in_en)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7073e87",
   "metadata": {},
   "outputs": [],
   "source": [
    "score_in_de = predictor.evaluate(test_de_df)\n",
    "print('Score in the German Testset:')\n",
    "print(score_in_de)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e057696",
   "metadata": {},
   "source": [
    "We can see that the model works for both German and English!\n",
    "\n",
    "Let's also inspect the model's performance on Japanese:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f79a6f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',\n",
    "                          sep='\\t', header=None, names=['label', 'text']) \\\n",
    "               .sample(200, random_state=123)\n",
    "test_jp_df.reset_index(inplace=True, drop=True)\n",
    "print(test_jp_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2f64266",
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))\n",
    "score_in_jp = predictor.evaluate(test_jp_df)\n",
    "print('Score in the Japanese Testset:')\n",
    "print(score_in_jp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4ba2df9",
   "metadata": {},
   "source": [
    "Amazingly, the model also works for Japanese!\n",
    "\n",
    "## Other Examples\n",
    "\n",
    "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n",
    "\n",
    "## Customization\n",
    "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}